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Abstract 

Tridiagonal linear systems of equations can be solved on conventional 
serial machines in a time proportional to N, where N is the number of 
equations. The conventional algorithms do not lend themselves directly to 
parallel computation on computers of the ILLIAC IV class, in the sense that 
they appear to be inherently serial. An efficient parallel algorithm is 
presented in which computation time grows as log^ N. The algorithm is 
based on recursive doubling solutions of linear recurrence relations, and 
can be used to solve recurrence relations of all orders. 
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I Introduction 

The trend in large-scale high-speed computers today clearly points to 
the use of internal parallelism to obtain significant increases in speed. 

For example, the ILLIAC IV computer can perform N simultaneous computations 
where N = 64*, 128, 256 , or 512 . We expect that highly efficient computations 
performed on a computer of the ILLIAC IV class will execute N times faster 
than on a serial computer of the same inherent speed. Actually, 
inefficiencies due to overhead and constraints on data communication among 
processors will reduce the speed increase to kN where k lies in the interval 
0 < k £ 1, Efficient algorithms have k near unity. 

Unfortunately, many parallel algorithms do not lend themselves to 
efficient parallel computation. We can exhibit examples of algorithms for 
which computation time decreases rather slowly as we increase the number of 
processors, and for some pathological examples the computation time is 
independent of the number of processors. An efficient parallel algorithm 
has the property that computation time decreases proportionally to 1/N as N, 
the parallelism factor, increases. 

In this paper we examine the solution of tridiagonal systems of linear 
equations. It is well known that such systems can be solved using a 
conventional serial computer in a time proportional to N where N is the 
number of equations. We present an algorithm for solving the equations in 
a time proportional to l°g 2 N by using a computer with N-fold parallelism. 
Computation in this case decreases as N) /N as N increases , which is 

greater than but very close to 1 /N, the desired rate of decrease. A different 
parallel algorithm for this problem which exhibits a similar time behavior has 
been developed by Buneman [1967] , and analyzed in the literature [Buzbee, et 
al., 1970]. 





- 2 - 

In Section II, we state the problem and indicate conventional serial 
methods for solution. These methods are inherently serial in that each 
computation depends on the result of the immediately preceding computation. 
In Section III we show how to perform a forward and backward sweep in 
logg N steps when given the LU decomposition of the original matrix. In 
Section IV we show how to obtain the LU decomposition in log^ N steps. 

This particular computation is of general interest because it is an 
efficient method for evaluating partial fraction expansions and linear 
difference equations in parallel. 

II Statement of the problem 

We wish to solve the tridiagonal system of equations 
A x = b 

where 



In the remainder of this paper we assume that N is a power of 2 , but 
this is not an essential assumption. 

There are a number of related methods for solving this system serially 
in a time proportional to N. The parallel algorithm presented here is 
based upon one such algorithm, the LU decomposition. [cf. Forsythe and 
Moler, 1967] In this algorithm we find two matrices, L, and U, such that 
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(!) LH=A 

(ii) is a lower bidiagonal matrix with l ! s on its principal diagonal, 

(iii) Ji is an upper bidiagonal matrix. 

When A is non-singular, its LU decomposition is unique. In fact, it 
is easily shown that 


U 1 f l 

U 2 f 2 

u 3 f 3 


U N-1 f N-l 


where f^, 1 ^ i £ N-l, is the upper diagonal of A, and 

u = d 
1 1 

e f 

u. = d. - i i-1. for i > 1. 

i. i 7 

Vi 


The lower bidiagonal matrix, ^L, is then given by 


Vi 1 


“N 1 


m 2 = e 2 / ' d l 


m. = 

x 


d . - f . 0 m . 

l-l i-2 l-l 


, for i > 2 


9 


for i £ 2 
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After computing L and U, it is relatively straight forward to solve 
the system of equations. The solution is a two-step process. 

Letting y = Ux, we have 

>V\ v ~ 

Ax=LUx = Ly=b 

V v *" (AV «,V- A* 

The equation L y = b is easily solved for y since 

y 1 = b ! (3> 

y. = b. - m.y. for 2 ^ i ^ N 

i li-l 

Then we solve U x = y for x. This equation is solved by a backward sweep 


*N = V U N 


x. = y . - x. n f . 
l l+l i 


Note that the recurrence formulae (l), (2), ( 3 ) and (^) constitute a 

complete algorithm for the solution of A x = b. Since each computation in 

v/v - 

this algorithm depends on the results of the previous computation, the 
algorithm is satisfactory for serial computation but quite unsatisfactory 
for parallel computation. In the following sections we derive equivalent 


formulae that are well-suited for parallel computation. 


Ill Parallel evaluation of the forward and backward sweeps 

The model of a parallel processor that lies behind the development of 
these parallel algorithms is based upon the ILLIAC IV computer. In this 
computer there are N processors with independent memories, but only one 
instruction stream. All of the processors operate synchronously, executing 
the same instruction on N different operand pairs, where N can be 6k. 128, 
256 , or 512 . For added flexibility, there is a mask associated with each 
processor that enables or disables the processor. Hence, if a processor’s 
mask is on, the processor executes the current instruction, otherwise the 
processor remains idle. 
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Data can be communicated among the processors in one of two ways. 

One datum can be broadcast to all processors simultaneously, or a vector 
of N items can be shifted cyclically among the processors. As an example 
of the latter case, suppose that the vector b = (b_ , b^, b^ , . . . b) is stored 

vw ' ±' <' JSj 

with b^ in the i*'* 1 processor. Then the vector can be shifted j places 
cyclically so that b^ is routed to processor (i+j) mod N for all i. 

In this section, we shall show how to solve (3) by a technique called 
recursive doubling . The idea is to rewrite (3) so that y is a function 
of y^. Thus, in successive iterations we can compute y^, y^, y^, yg, etc., 
and y^ can be computed in log^ N iterations. Since (4) is of the same 
form as (3), the backward sweep can be done using the same algorithm, and 
it also requires log^ N iterations. 

To begin the derivation, we rewrite (3) in the form 


y i = b i 


y. = b. + (-m. )y. _ 
J ± i v 


(3’) 


This change is necessary because we shall make use of the associativity of 
addition. 

Substituting for y ^ in ( 3 *) we find 

y 2 ~ ( — ^2 ^ * ^1 

y 3 = b 3 + (- m 3 )- b 2 + (^ 3 )-(-“ 2 )- b i 


(5) 


y i =. s . b j . n .. <-\) 


j=l 0 k=j+l 


where a vacuous product of m^'s is interpreted as the constant 1. 

The last formula in (5) shows the explicit dependence of y^ on each of 
the coefficients of m and b. Our goal is to derive a recurrence in which 


y^ is a function of y^. To anticipate the answer, momentarily consider 
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what happens when all of the components of are equal to -1. In this 

case is the sum of the first i components of b. Then if , 

b ) is defined to be the sum of b. through b. . . we have 

j-i+l' J J-i+1' 

y 2i( h 2i’ h 21-l*‘’‘ ,h l) = y i( b 2i' b 2i-l' * * * ,b i+l^ + ^ 

y i( b i^ b i-l" # * # ^ b l^ 

Equation (6) holds for all i £ 1. This recurrence has the recursive 

doubling form that we seek, and therefore is the basis for a parallel 

algorithm. The recursive doubling relation above suggests that we look for 

a general solution in terms of functions Y . .., Y^ where each Y^ is a 

function of i components of and | m. We shall use the notation Y^(j) as an 

abbreviation for the more cumbersome notation 

Y. (b . . b . -.«•««« b . • it) 

i J J-l 9 j-i+1* J* J-l' J-i+1' 

That is> Y^(j) is a function of i consecutive components of and m, with 
th 

the j component being the highest component. 

The following theorem establishes the relation we desire. 

Theorem 1 : Let Y^(j) satisfy the recurrence relation 

Y i+ l(j) = Y 1 (j) + Y i (j-l)-(-m J ) for i, j s 1 ( 7 ) 

with the boundary conditions 
Y 1 (j) = b for j a 1 

Y i (j) = 0 for j ^ 0 

Y. (j) =0 for i <: 0 


(i) for s s 2, Y i (j) satisfies the recurrence relation 

, j 

Y i+S (j) = Y s(J) + Y i(j -S ) n (-\) ior i i 1, J i s. ( 8 ) 

k=j-s+l 


Then 



- 7 - 


(ii) 


(iii) 


j j 

Y (j) = S Y (k) n (-mV for i ^ j i 1 
1 k=l 1 s=k+l S 

for i ^ j ^ 1, Y. ( j ) = y where y. is the 
i J J 

unique solution of ( 3 ). 


(9) 

component of the 


Proof : 

To prove part (i), we use induction on s. 
Basis step, s = 2 . 

From ( 7 ) we have 


Y i +2 (3) = VjU) + Y 1+1 (J-l)-(-j) 

= Y 1 (j) + Y 1 (j-l)-(-m j ) + Y i (j-2)*(-m J ).(-m J _ 1 ) 


But using ( 7 ) again we also have 


Y 2 (j) = Y 1 (j) + Y 1 (j-l)»(-m j ) 
Hence, 


Y i+ 2 (J) = Y 2 (j) + Y.(j-2).(-m j ).(- mj _ 1 ) 
which is recurrence relation ( 8 ) with s = 2, This proves the basis step. 


Induction step . We assume that (8) hold for all s in the interval 


2 ^ s ^ n- 1 , and we show it holds for s = n. 


From the induction hypothesis we have 

Y i +n (J) - + Y i + i(J- n+1 ^ n (-%) 

k=j-n+2 

j 

= Y n -i( j) + Y i(J- n+1 )- n (-\) 

k=j-n+2 
j 

+ Y (j-n). n (- 
k=j-n+l 


But from the induction hypothesis it follows that 

j 

Y n (J) = Y n . 1 d) + Y 1 (j-n+l). n (-mj 

k=j-n+2 
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Hence, 


Y. +n (j) = Y (j) + Y (j-n). n (-m.) 

1+n n 1 k=j-n+l k 

which is the same recurrence as (8) with s replaced by n. This proves 

part (i). 

To prove part (ii), we use induction on i. 

Basis step. From the theorem hypothesis we have 
Y 2 (j) = Y 1 (j) + Y 1 (j-l)-(-m j ), for j £ 1 

Then applying the boundary condition Y^O) = 0, we obtain 
Y 2 (l) = Y 1 (l) 

Y 2 (2) = Y x (2) + Y 1 (l)*(-m 2 ) 

These equations satisfy ( 9 ), thus proving the basis step. 

Induction step : We assume that (9) holds for all i in the interval 

2 ^ i ^ n-1, and we prove that it holds for i = n. Using (8) we have 

Y n (j) = Y i(J) + 

Using the induction hypothesis to substitute for Y r ^(j-l) yields 

'j-1 J-l 1 

Y(j)= Y -,(j) + 2 Y (k) n (-m ) *(-m ) for 2 £ j £ n 

“ 1 )_ k=1 s=k+1 
j j 

= E Y..(k) [] (-m ) for 2 ^ j < n ( 1( 

k=l s=k+l S 


The interval 2 ^ j ^ n for which the equations above are valid arises from 
the application of the induction hypothesis to Y^ ^(j-l) for 1 ^ j-l ^ n-1. 
Since (10) has the same form as ( 9 ), it is only necessary to show the 
validity of (10) for j = 1 to complete the proof. From the theorem 
hypothesis, 

V 1 ) = V 1 ) + Y n-l(°) = V 1 ) 

Since the same result is obtained by setting j = 1 in (10), the interval 
in (10) may be changed to 1 £ j £ n. This proves part (ii) of the theorem. 
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Part (iii) is a direct consequence of the fact that with the boundary 
condition Y^(j) = b^, (10) is identical to the solution to ( 3 ). This 
completes the proof of the theorem. 

Corollary: 

j 

Y 2 i (j) = Y i (j) + Y ± ( j-i)* n (-n^) for i, j ^ 1 (ll) 

k-j-i+1 

Proof: Follows directly from part (i) of Theorem 1 by replacing s 

by i. 

The corollary of Theorem 1 provides the recursive doubling algorithm 
for the solution of (3). The product term in (ll) appears to be difficult 
to evaluate because the number of factors in the product doubles with each 
iteration. Fortunately, we can also use recursive doubling to compute the 
product term. 

Let ]VL(j) be defined to be 

j 

M (j) = n (-iO for j ^ i 

k=j-i+l 

j ( 12 ) 

= II ("“v) for j < i 

k=l k 

Then (ll) can be rewritten as 

Y 2 i ( J ) = Y i(j) + Y i (j-i)*M i (J) for i,j ^ 1 (13) 

The recursive doubling computation of M^j) is provided by the formula 

M 21 (J) = M i (j).M ± (j-i) for i,j ^ 1 (14) 

with the boundary conditions 

M (j) = -m. for j ^ 1 

1 3 

M i (j) = 1 for j £ 0 

M.(j) = 1 


for i ^ 0 
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The parallel algorithm for the solution of ( 3 ) is simply the iterative 
application of ( 13 ) and (l4) . It is given below in an ALGOL- like language. 
In the program, when an interval of the form (l ^ j < N) appears after a 
statement, that statement is assumed to be executed simultaneously for all 
indices in the interval. 

b egi n 

rgal array Y[l:N], M[2:N]; 

real array b[l:N], m[2:N]; 

comment Y and M are the arrays in which equations (13) and ( 14 ) 

are evaluated. Arrays b and m are the arrays that give the 
coefficients of (3)* These arrays may utilize the same 
storage space as the arrays Y and M, respectively; 

initialize: 

Y[j] : = b[j], (1 £ j £ N); 

M[j] : = (1 £ j £ N); 

for i : = 1 gtep i unty. N/2 do 

begin 

Y[j] : = Y[ j] + Y[ j-i] X M[j], ( i+1 < j <: N); 

M[j] : = M[ j] X (i+1 ^ j ^ N); 

end: 

At the completion of each iteration, the array Y contains Y i (j), and 
M contains IVL(j). From Theorem 1, Y^(j) = for 1 ^ j ^ N, so that Y^ is 
the solution to (3). Since i doubles during each iteration, log^ N 
iterations are required for the computation. The vector operations 
indicated in the program are easily carried out in an ILLIAC IV type of 
computer since masking operations can be used to establish the interval 
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for the index j, and cyclic shifting of components of a vector can be used 
to align Y[ j] with Y[j-i]. The parallel algorithm is also suitable for 
efficient operation in vector processors of the pipeline class such as the 
CDC STAR computer. 

For the solution of the backward sweep, Equation ( 4 ), the body of the 
iteration should be modified as indicated below: 


begin 

Y[j]: = Y[j] + Y[ j+i] X M[j], (l £ j £ N - i); 

M[ j] : = M[j] x M[j+i], (l^j^N-i); 

end; 

IV Calculation of the LU decomposition by recursive doubling 

We now focus attention on the efficient calculation of (l) and (2). 

Again we use recursive doubling to compute the coefficients u = (u^u^, . . */ u N ) 
and m = (m^m^, . . ., 11 ^). The approach we use is to solve (l) by recursive 
doubling, then compute m i = e i / u i 1 simultaneously for 2 ^ i ^ N to solve (2). 

Since (l) is a partial fraction expansion, it is convenient to cast it 
into a linear form which is suitable for a recursive doubling algorithm. It 
is well known [cf. Wall, 19^8] that every partial fraction expansion is 
associated with a linear second order recurrence relation. In particular. 


if we define the quantities q^ 0 ^ i £ N, by the recurrence relation 


q i “ d i q i-l “ e i f i-l q i-2 


i £ 2 


with the boundary conditions 




for i s 1 


(IS) 
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or equivalently. 


q 


i 


i 

= n 

j=i 


To solve (l) efficiently, we have only to solve (15) efficiently, 
because after calculating q^, 0 ^ i ^ N, we can evaluate (l6) in a single 
operation carried out simultaneously on N processors. Equation (15) is 
somewhat more difficult to solve than (3) because it is of second order, 
whereas (3) is of first order. However, we can make use of an artifice to 
reformulate (15) as a matrix recurrence relation of first order. In 
particular, it follows from (15) that 


Vi 


d. 

1 


0 


4-1 


q i-2 


A. 

u>l 


4-1 


4-2 


Note that we can substitute A. .. (q. _ q. „) for (q. . q. „) above. 

—l-l v i-2 1-3 v i-l i-2' 

and can continue this substitution repeatedly until we obtain 


q i 


A. A . 1 • * • Aq 


*i-l 


(17) 


This formulation of the problem is ideal for recursive doubling. Since 

matrix multiplication is associative, we can evaluate the product 

^...A^ in exactly the same way that we evaluate a product of scalars. 

In fact, we have encountered this problem before in ( 12 ), and the recursive 

doubling solution is the schema of (lk) . Then to solve (15) for all q i 

t h 

simultaneously, requires log^ N iterations, in which the _i iteration 
N-i 

involves the 2 simultaneous calculations of the product of two 2X2 


matrices. 
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It is rather interesting to investigate the properties of the functions 
because it is possible to exploit their characteristics and obtain a 
parallel algorithm slightly more efficient than the solution to (17) 
described above. Fortunately, a great deal is known about these functions. 



Knuth [1971] attributes to Euler [17^8] the observation that 

contains the term d^d^ ^...d^, together with every term that can be 

constructed by replacing d.d. _ by -e .f . for all possible combinations of 

3 J-l 3 J-l 

such pairs. This property follows directly from the recurrence relation 


(15). The first product in (15), d^q^ creates terms in q^ for which 
adjacent d-pairs are deleted from among only the coefficients d^, d , . .., 
di ^ in all possible ways, and thus produces every possible way there can 
be terms containing d^. The second product in (15) replaces d^d^ ^ by 
” e i^i ancl conib i nes this with every possible way d-pairs can be eliminated 
among the coefficients d^, d^>, . .., d^ This produces every possible term 

without d. . 

1 

We can obtain factorizations of the q^ functions that correspond to the 
intermediate results in the evaluation of (17). To arrive at these 
factorizations, let us define Q^(j) for j ^ i to be the function q^ with the 
subscripts of its arguments increased systematically so that the leading 
subscript is j. For j < i, we define Q i (j) = Qj(j). Some examples of Q i (j) 
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should clarify ambiguities in the definition. 

Q x (l) = 

Q 2 (l) = 

= d 3 d 2 d r d 3 e 2 f r e 3 f 2 d i 

= dj^d^dg-d^e^f^-e^f^dg 

Q 3 (2) = Q 2 (2) = d 2 d 1 -e 2 f 1 

From this definition it now follows directly that the functions 
satisfy the recurrence 

W j) = V j)Q i (j “ s) " Vs+iVsVi^ii^ 8 - 11 (l8) 

for j ^ s, i 2: 1 

with the boundary conditions 

Q,( j ) = d for j 2 1 

*1* J 

Q-^j) = 1 for j 2 0, i <: 0 

(^(j) = 1 for j £ 0, i 2 0 

e 3+1 . f j = 0 for j <: 0 

This recurrence formulation is also well-known, with citations in the 
literature at least as early as 1853 . [Sylvester, 1853; Perron, 1913 ] • 

The validity of (l8) can be verified by an intuitive argument. To 

find all possible ways of eliminating adjacent d-pairs in a sequence of 

i+s coefficients, combine every possible way of eliminating pairs in the 

first s coefficients with every possible way of eliminating pairs in the 

last i coefficients. This accounts for the first term of (l8). However, 
one d pair contains the last coefficient from the set of s coefficients and 
the first coefficient from the set of i coefficients. The first term in 
(l8) does not account for any of the ways this pair can be eliminated. We 
see that the second term in (l8) accounts for all such ways, because 
e^. g replaces the pair and this replacement is combined with every 
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possible way of eliminating pairs in the first s-1 coefficients and in the 
last i-1 coefficients. From (l8) we obtain the recursive doubling formulae 

Theorem 2 ; Q i (j) satisfies the recurrence relations 

(19) 

Q 2i _ 2 (J) = Q i _ 1 (d)9 i _ 1 (J- i + 1 ) + 

Proof: These formulae follow directly from (l8). 

The first of the equations in the corollary above is a recursive 
doubling formula which shows that Q^ i depends on both Q ± and Hence, 

to compute we need to compute both and Q^ ± ^ . To compute ^ we 

have to compute Since ^±-2 ^ e P enc ^ s on the same quantities as 

and Q^ i we need only the three equations ( 19 ) in a recursive doubling 
algorithm. Since we have to compute and Qgi 2 an y wa ^^ it is slightly 

more efficient to compute Q by the f°rmula 

Q 2i (J) = d j^2i " l ^ -e j f j-1^2i-2^ -2 ^ ’ 

The complete algorithm to compute q , 1 ^ i ^ N is given below in an 
ALGOL-like language. The initial conditions establish the values of Q 
and Q^. The first iteration computes and the second iteration 

computes Qg, and Qg, and the last iteration computes and 


begin 


r ea l |rray e[ 2:N], F[l:N-l], dQ!:N], EF[l:N], 

TEMP[l:N], Ql[l:N], QIM1[0 : n], QIM2[-1:N]; 
comment the arrays hold the quantities indicated below. 

E The lower diagonal of the tridiagonal matrix A. 

F The upper diagonal of A. 

D The major diagonal of A. 


EF 


This holds products of the form -e.f . _ . 

x i-l 
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TEMP 

A temporary array . 

QI 

Holds Q i ( j ) . 

QIM1 

Holds Q._ 1 (j). 

QIM2 

Holds Q ± _ 2 (j). 


The computation begins by initializing EF, QI, QIM1, and QIM2; 
initialize : 

EF[i] := - E[i]XF[i-l], (2 < i <; n) ; EF[1] := 0; 

QIM2[i] := 1, (1 £ i £ N); 

QIMl[i] := D[i], (1 < 1 <■ N); QIM1[0] := 1; 

Ql[i] := D[i]XD[i-l] + EF[l], (2 <: i :£ N); 

QI[1] := D[ l] ; 

comment the last three lines initialize the arrays to Q^, Q^, and Q^, 
respectively ; 

for i := 2 step i until N/2 do 

i *■ - in vwv (A — ' 

begin 

TEMP[j] := QIMl[j]XQIMl[j-i+l] + EF[ j-i+2]XQIM2[ j]xQIM2[ j-i], 

(i-1 <; j £ N); 

comment TEMP contains It cannot be written over ^ y et 

since Q. _ is needed in the next line; 

i-2 

QIMl[j] := Ql[j]xQlMl[j-i] + EF[j-i+l]XQIMl[ j]xQIM2[ j-i-l], 

(i S j <: N); 

QIM2[j] := TEMP[j], (i-1 £ j £ N) ; 

QI[J] := D[j]xQIMl[j-l] + EF[j]xQIM2[j-2], (i+1 <= j £ N); 
end; 

At the termination of the algorithm, Ql[i] will contain for 
1 £ i £ N. We use (l 6) to compute the diagonal of U from the q.’s. This 

w*. i 

clearly can be done in parallel by dividing the vector QI by a shift of 
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itself. Finally, to compute the subdiagonal of JL, we note that (2) indicates 
that this computation can be done by one parallel division. 

In executing the algorithm on an ILLIAC IV class of computer, the 
vector alignment required for the calculation is done by cyclically shifting 
vectors among the processors. Since the algorithm requires that 
QIMl[j] = QIM2[j] = 1 for j ^ 0, we can avoid storing these quantities by 
changing the cyclic shift of these vectors to an end-off shift in which the 
integer 1 is shifted into each of these vectors. Similarly, 

EF[j] = 0 for j £ 1, so that 0’s are always shifted into EF[2] when the EF 
vector is aligned. 

The ranges indicated for each statement in the basic iteration show the 
positions of the vectors which change when that statement is executed. The 
algorithm will work correctly when all ranges are replaced by the full 
range 1 <, 1 <, N since values that do not change are recomputed at each step. 
It is somewhat more efficient to use the full range for a calculation than 
the ranges given, although redundant recomputation of values may be 
accompanied by greater round-off error. 

The serial solution of a tridiagonal system of equations, when done as 
outlined in Section II, requires 3(N-l) of each of the operations division, 
multiplication, and subtraction. The parallel computation has three loops, 
each executed log^ N times. The loop that computes the LU decomposition 
requires eight multiplications and three additions per iteration, whereas 
the forward and back substitutions each require two multiplication and one 
addition per iteration. Apart from the computations within loops, there 
are at least four divisions, two multiplications and one addition applied 
to N elements simultaneously. 
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Hence the operation count for the parallel algorithm (exclusive of 
overhead computations) is 

12 log^ N + 2 multiplications 
5 log^ N + 1 additions 
k divisions. 

The reduction in the number of divisions is particularly important for 
computers which take much longer to divide than to multiply. (On the 
ILLIAC IV computer division is approximately five times longer than 
multiplication) . 

At this writing the stability of the algorithm has not been thoroughly 
investigated. Clearly, the algorithm is unstable if any vanishes. 


Since q. = n u., q. vanishes if and only if one of the u. coefficients 
1 j=i J 1 

vanishes. However, if the matrix is diagonally dominant and non-singular, 
every u^ is bounded away from zero [Isaacson and Keller, 1 966]. 

We conjecture that the error bounds for the parallel algorithm are 
comparable to those of the serial algorithm. 


/ 



Summary and conclusions 
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The parallel algorithm for the solution of tridiagonal systems of 
linear equations really consists of two different algorithms. One 
algorithm is the parallel evaluation of first order difference equations 
of the form 

x. = b.x. , + c. 

1 11-I 1 


where the b^ and c^ are constants. 

The second algorithm solves second order equations of the form 


x i 


b. x. , + c.x. _ 

1 l-l 1 i -2 


( 20 ) 


Since partial fraction expansions are associated with second order 

difference equations, the second algorithm may also be used to compute 

partial fraction expansions. The form of the solution obviously general- 

th 

izes to linear recurrence relations of arbitrary m order, still requiring 
log 2 N iterations, where each iteration involves simultaneous multiplications 
of m x m matrices. 

It is well known that a straightforward serial evaluation of (20) can 
be unstable [Gautschi, 1967 ]* although it is not unstable when the 
coefficients are obtained from diagonally dominant matrices. The stability 
of the parallel algorithm in such cases has not been investigated, but it 
too is undoubtedly unstable. Since (20) can be solved by backward 
recursion when forward recursion is unstable, we expect that backward 
parallel recursion would also be stable. 
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